The primary goal of our project is to determine if we can closely predict movie ratings before critic ratings are released and the films are given an average user score on IMDb. Being able to determine whether or not a movie will be a quality release without critic ratings could address a number of business problems that surround the financing and marketing strategies for different types of movies. Film studios and movie theatres could use these insights to budget accordingly and properly advertise movies to maximize popularity, ratings, and profits. The dataset we will be using contains 5,043 movies, and 28 variables related to the movie.
The primary research questions we wish to ask revolve around predicting a movie’s IMDb score. One of which is will the number of faces in a movie poster correlate with the movie’s rating? Can the multiple actor variables and the director variable be useful in predicting a movie’s IMDb score? Does a movie’s parental guidance rating (meaning R, PG-13, etc.) play into how well a movie scores? Does a movie’s budget help determine a movie’s critical success? Do certain genres tend to receive higher critic scores?
Our data was retrieved as a CSV file from Kaggle:
https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset
The data was retrieved by scraping 28 values for 5043 movies from the IMDb website and by using facial recognition software run on 4906 posters. The movies’ release dates span over 100 years and the movies originate from 66 different countries. The variables of the dataset are as follows:
Some of our data will need to be cleansed before moving on to analysis. The variable “director_facebook_likes” is inaccurate in many instances. Foreign films using currencies different from USD may have a misrepresented gross revenue, because the values aren’t equivalent with foreign currencies. This can be addressed by transforming these values by an exchange rate. We would also like to retrieve the release date for each film, since that variable was not included in the original dataset.
The goal of our project is to find connections between variables and the film’s IMDb score. We believe that the key variables are the following: facenumber_in_poster, director_name, actor_1_name, actor_2_name, and actor_3_name. Our hypothesis is that these variables will be crucial in predicting a movie’s overall critic score. We also hypothesize that budget, gross, actors, directors, genre and content rating will play into how much a movie grosses in revenue.
Clear the environment
rm(list = ls())
Load piping Package
# Note: User may need install packages if they have not previously used them
library(magrittr)
Load Data Manipulation Packages
library(plyr)
library(dplyr)
Load movie CSV file in as a data frame
movie_df <- read.csv("movie_metadataCSV.csv", header = TRUE)
# Change blank spaces to NA
movie_df[movie_df==""] <- NA
show structure of movie_df
print(str(movie_df))
## 'data.frame': 5043 obs. of 28 variables:
## $ color : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 NA 3 3 3 3 3 ...
## $ director_name : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
## $ actor_1_name : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 35 ...
## $ movie_title : Factor w/ 4917 levels "#Horror ","[Rec] 2 ",..: 398 2731 3279 3707 3332 1961 3289 3459 399 1631 ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 NA 2714 1970 2163 3018 2941 ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 NA 651 4745 29 1142 2005 ...
## $ movie_imdb_link : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 NA 13 13 13 13 13 ...
## $ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 NA 65 65 65 65 63 ...
## $ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 NA 10 10 9 10 9 ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
## NULL
Most of the variables for the IMDb data set were formatted correctly to begin with, but some cleansing was required.
See if there are any observations with NA values for title or score making analysis useles
count(movie_df[!(is.na(movie_df$movie_title)) & !(is.na(movie_df$imdb_score)), ])
## # A tibble: 1 x 1
## n
## <int>
## 1 5043
It appears that every observation has a score and title
Convert categorical variables into factors to improve analyses
Color is a factor but is currently a character type, need to change it to a number. 1 = “Color”, 0 = “Black and White”
movie_df$color <- ifelse(movie_df$color == "Color", 1, 0)
movie_df$color <- as.factor(movie_df$color)
Convert number of faces in poster to a factor
movie_df$facenumber_in_poster <- as.factor(movie_df$facenumber_in_poster)
movie_df$title_year <- as.factor(movie_df$title_year)
We created a new variable that calculated the gross profit of a film by subtracting the budget of the film from the gross revenue. This will allow us to perform analyses later on and see how other factors may affect the profit of movies.
create new variable gross_profit by subtracting budget from gross
# Create new vector, positive means budget was lower than gross, negative means that budget exceeds gross
movie_df$gross_profit <- movie_df$gross - movie_df$budget
The values of the “movie_title” variable in our data set had the character “” appended at the end and white space at the end, so we removed the character and trimmed the white space for better readability and to prevent any issues that might have arisen without eliminating the extra white space.
# Remove the  at the end of the titles
movie_df$movie_title <- gsub("Â", "", movie_df$movie_title)
# trim whitespace
movie_df$movie_title <- trimws(movie_df$movie_title)
Check to see if there are duplicated movie titles
# Show number of unique titles
length(unique(movie_df$movie_title))
## [1] 4916
# Show number of duplicated titles
anyDuplicated(movie_df$movie_title)
## [1] 138
We need to remove duplicates to avoid skew in our findings, but cannot base it just on titles as some movies have remakes. We are chose to remove films that have the same title, director, and release year, as this means that it is likeley they are duplicates
# statement will remove movies if they have duplicate titles, directors, release years and release years
movie_df <- movie_df %>% distinct(movie_title, director_name, title_year, .keep_all = T)
It appears that there are 4919 unique movies; movies with unique titles, directors and release years
Show updated structure of data set
str(movie_df)
## 'data.frame': 4919 obs. of 29 variables:
## $ color : Factor w/ 2 levels "0","1": 2 2 2 2 NA 2 2 2 2 2 ...
## $ director_name : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
## $ actor_1_name : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 35 ...
## $ movie_title : chr "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 NA 2714 1970 2163 3018 2941 ...
## $ facenumber_in_poster : Factor w/ 19 levels "0","1","2","3",..: 1 1 2 1 1 2 1 2 5 4 ...
## $ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 NA 651 4745 29 1142 2005 ...
## $ movie_imdb_link : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 NA 13 13 13 13 13 ...
## $ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 NA 65 65 65 65 63 ...
## $ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 NA 10 10 9 10 9 ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : Factor w/ 91 levels "1916","1920",..: 84 82 90 87 NA 87 82 85 90 84 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
## $ gross_profit : num 523505847 9404152 -44925825 198130642 NA ...
Show summary statistics now that the dataset has been cleansed
Load stargazer to improve appearance of summary statistics
library(stargazer)
stargazer(movie_df, type = "text", title = "Movie Data frame Summary Statistics")
##
## Movie Data frame Summary Statistics
## =============================================================================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------------------------------------------------------
## num_critic_for_reviews 4,870 138.010 120.254 1 813
## duration 4,904 107.089 25.279 7 511
## director_facebook_likes 4,817 690.705 2,832.107 0 23,000
## actor_3_facebook_likes 4,896 630.961 1,625.429 0 23,000
## actor_1_facebook_likes 4,912 6,490.876 15,103.080 0 640,000
## gross 4,056 47,621,564.000 67,363,869.000 162 760,505,847
## num_voted_users 4,919 82,618.200 138,285.800 5 1,689,764
## cast_total_facebook_likes 4,919 9,574.646 18,159.990 0 656,730
## num_user_for_reviews 4,898 267.604 372.839 1 5,060
## budget 4,434 39,288,218.000 208,557,941.000 218 12,215,500,000
## actor_2_facebook_likes 4,906 1,621.072 4,010.223 0 137,000
## imdb_score 4,919 6.438 1.128 1.600 9.500
## aspect_ratio 4,593 2.222 1.403 1.180 16.000
## movie_facebook_likes 4,919 7,345.294 19,200.710 0 349,000
## gross_profit 3,790 5,468,410.000 228,898,340.000 -12,213,298,588 523,505,847
## ---------------------------------------------------------------------------------------------
# Stargazer only shows statistics for non categorical and character types
Load plotly visualization package
# Note: plotly package graphs are interactive
library(plotly)
library(ggplot2)
Show the distribution of movie scores with a kernal density plot
# Kernel density plot
score_density <- density(movie_df$imdb_score) # Calculate kernel density estimate
plot(score_density, type = "n", main = "Kernel Density Estimate of Movie Scores")
polygon(score_density, col = "wheat")
The kernal density plot shows the density of each movie score. We can see that the most common score range is from approximately 6 to 6.5. This means that most films released will receive a score close to a 6. We need to utilize more visualization techniques to discover what causes films to reach exceptional scores to go above and beyond the average.
We can look to see which years tend to have higher quality movies based on average IMDb scores of movies over time.
# Create data frame with year and average score
yearly_score <- ddply(movie_df, c("title_year"), summarise,
mean(imdb_score, na.rm = TRUE))
colnames(yearly_score) <- c("title_year", "avg_score")
# Plotly line graph that shows average imdb scores across the years
plot_ly(yearly_score, x = yearly_score$title_year, y = yearly_score$avg_score, type = "scatter", mode = "lines") %>%
layout(title = "Year and Score")
Up until the 1970s, only a few movies may be registered in the IMDb database for a given year, with many years only yielding a single movie. This leads to large variances between years up until the 1970s, such as how a single movie in 1920 is completely representative of the year’s extremely low average movie IMDb score of 4.8, only to surge the exisitng data point because a single movie performed better in 1925 and received a rating of 8.3. Regardless, the general trend appears to be a declining average IMDb score over time. This may not reflect the quality of large production films, but rather reduced barriers to entry have enabled a surge in indie films to enter the market and the fact that the IMDb database can also register these films with the rise of internet-based titles being published on platforms like YouTube.
Show number of movies released in each year
Create df with number year and movies in that year
movie_yearly_count <- movie_df %>%
group_by(title_year) %>%
summarise(yearly_total = sum(length(title_year)))
Verify sum of number of movies in the data frame
sum(movie_yearly_count$yearly_total)
## [1] 4919
Plotly line graph that shows number of films released by year in the data set
plot_ly(movie_yearly_count,
x = movie_yearly_count$title_year, y = movie_yearly_count$yearly_total, type = "scatter", mode = "lines") %>%
layout(title = "Total films released by year")
# Plot skips some years along the x-axis, but this was for spacing; the interacting with the chart will show count for every year
This graph shows the total films released by year included in the data set. There are way more movies released than this throughout these years, but the data set only includes a few thousand movie observations. The older years may have biased average scores, because there are so few entries included in the data set for that year.
We can explore which actors have the highest average movie score overall.
# Find actors with highest average movie score
actor_rating_df <- ddply(movie_df, c("actor_1_name"), summarise, # Pull out actor 1 and rating data
actor_m <- mean(imdb_score, na.rm = TRUE), # Find the mean of actors and discount na values
number <- length(na.omit(imdb_score))) # Count number of movies, but omit na values
colnames(actor_rating_df) <- c("actor_name", "avg_score", "num_of_movies") # Rename headers
actor_avg <- actor_rating_df[which(actor_rating_df$num_of_movies >= 7),]# Actors in at least 7 movies to avoid skew
actor_avg <- actor_avg[which(actor_avg$avg_score >= 7.15),] # Show only highest scoring actors for readability
# Convert actor_name to a factor
actor_avg$actor_name <- factor(actor_avg$actor_name)
# Show highest to lowest average score
actor_avg$actor_name <- reorder(actor_avg$actor_name, actor_avg$avg_score)
# Visualization of highest scoring actors
ggplot(actor_avg, aes(x = avg_score, y = actor_name)) + # Sets x to average score and y to the name
geom_point(aes(colour = actor_name)) + # Makes a point graph with different colors per actor
theme(axis.text = element_text(size = 10)) + # Sets font size
ggtitle("Actor Scores") + # Changes chart title
xlab("Average Score") + # Changes x field title
ylab("Actor Name") # Changes y field table
We can see a visual representation of actors with the highest average IMDb ratings for the films they star in, with the top 3 being Leonardo DiCaprio, Tom Hanks, and Clint Eastwood. Predictions can be made for movie score utilizing these top scoring actors. This is implemented later on in this report.
We can explore which directors have the highest average movie score overall.
# Find actors with highest average movie score
dir_rating_df <- ddply(movie_df, c("director_name"), summarise, # Pull out director name and rating data
dir_m <- mean(imdb_score, na.rm = TRUE), # Find the mean of directors and discount na values
number <- length(na.omit(imdb_score))) # Count number of movies, but omit na values
colnames(dir_rating_df) <- c("director_name", "avg_score", "num_of_movies") # Rename headers
dir_avg <- dir_rating_df[which(dir_rating_df$num_of_movies >= 7),]# Direct at least 7 movies to avoid skew
dir_avg <- dir_avg[which(dir_avg$avg_score >= 7.5),] # Show only highest scoring directors for readability
dir_avg <- dir_avg[which(dir_avg$director_name != ''),] # Exclude movies who have no director name
# Convert actor_name to a factor
dir_avg$director_name <- factor(dir_avg$director_name)
# Show highest to lowest average score
dir_avg$director_name <- reorder(dir_avg$director_name, dir_avg$avg_score)
# Visualization of highest scoring actors
ggplot(dir_avg, aes(x = avg_score, y = director_name)) + # Sets x to average score and y to the name
geom_point(aes(colour = director_name)) + # Makes a point graph with different colors per director
theme(axis.text = element_text(size = 10, colour = 'black')) + # Sets font size
ggtitle("Director Scores") + # Changes chart title
xlab("Average Score") + # Changes x field title
ylab("Director Name") # Changes y field table
We can see a visual representation of directors with the highest average IMDb ratings for the films they star in, with the top 3 being Christopher Nolan, Quentin Tarantino, and James Cameron. Predictions can be made for movie score utilizing these top scoring directors. This is implemented later on in this report.
We can explore the average IMDb score for each content rating.
content_df <- ddply(movie_df, c("content_rating"), summarise, # New DF w/ rating, avg score and # of movies
mean(imdb_score, na.rm = TRUE), # Creates column containing mean scores
length(na.omit(imdb_score))) # Omits scores with NA values
colnames(content_df) <- c("content_rating", "avg_score", "number_of_films") # Rename columns
content_df <- content_df[which(content_df$content_rating != ""),] # Eliminate films from analysis without rating
# Visualization of average score by content rating
ggplot(content_df, aes(x = content_rating, y = avg_score)) + # Bar graph with x as rating and y as score
geom_bar(stat = "identity", width = .5, fill = 'deepskyblue3') + # Creates space between bars and changes color
theme(axis.text = element_text(size = 8, colour = 'black', angle = 45)) + # Sets text size to 8, text color to black, and angles text for separation purposes
ggtitle("Content Scores") + # Title of graph
xlab("Content Rating") + # Title of x axis
ylab("IMDB Score") # Title of y axis
Based on this plot, we can see the highest performing category of movies based on content rating is TV-MA (intended for a mature audience), with the lowest performing category being PG-13.This could possibly be the effect of restricted possibilities of movies that target a PG-13 audience not being as favored by mature audiences who watch both TV-MA movies and PG-13 movies.
We can show the spread of scores per content rating category with box plots.
# Create data frame that excludes movies with no content rating
content_df2 <- movie_df[which(movie_df$content_rating != ''), ]
#Use plotly for great boxplot capabilities, note: Might need to install "plotly"
plot_ly(content_df2, x = content_df2$imdb_score, color = content_df2$content_rating, type = "box") %>%
layout(title = "Movie Type and Score Range")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
These boxplots show the range film scores for each content rating. These outliers should not be removed because the values are legitimate. It is very possible for movie scores to achieve scores outside the boxplot range; either higher or lower.
Explore Average IMDB rating by number of faces in poster
# Create df with face numbers in poster and respective average score
face_df <- movie_df %>%
group_by(face_count = facenumber_in_poster) %>%
summarise(avg_score = mean(imdb_score)) %>%
na.omit() # Omit NA values
# Bar Graph of data
ggplot(face_df, aes(x = face_count, y = avg_score)) +
geom_bar(stat = "identity", width = .5, fill = "deepskyblue3") +
xlab("Number of Faces") +
ylab("Average Score") +
ggtitle("Average Score by faces in poster") +
theme(axis.text = element_text(size = 8, angle = 0))
The average score by faces in poster column graph shows whether having less faces in poster or more affects average score. It does not appear there is a strong relationship between the variables. While the poster with 43 faces has the highest average score, it is because there is only one movie with that the score so it is biased. The rest of the number of faces in posters seems to be fairly consistent throughout minus a few low-points.
Show 5 number summary for average score based on color using boxplot method
# Create Data Frame with average score and color
color_film <- movie_df %>%
select(color, imdb_score) %>% na.omit()
ggplot(color_film, aes(x = color, y = imdb_score, group = color, fill = color)) +
geom_boxplot() +
xlab("Color Type") +
ylab("Score") +
ggtitle("Black and White vs Color Film Scores")
On average colored films tend to score a higher IMDb. We decided to keep the outliers because they are valid data.
Visualize how language affects movie score
language_df <- movie_df %>%
group_by(language) %>%
summarise(avg_score = mean(imdb_score)) %>% na.omit()
ggplot(language_df, aes(x = language, y = avg_score)) +
geom_bar(stat = "Identity", width = .5, fill = "deepskyblue3") +
theme(axis.text = element_text(size = 10, angle = 90)) +
xlab("Original Language") +
ylab("Average Score") +
ggtitle("Language and Score")
Based on the chart the language that scored the highest IMDb score is the Telungu language. This is not very representative because of the small number of movies that are actually in that language. It would not be a fair assumption to say that a language that has a small amount of movies and a high IMDb score would always get a higher score than a language that has more movies.
# Load in world map from map_data
world_countries <- map_data("world2")
Visualize how the country the film was released in affects average score
ScoreByCountry_df <- movie_df %>%
group_by(country) %>%
summarise(avg_score = mean(imdb_score)) %>% na.omit()
score_plot <- ggplot()
score_plot <- score_plot + geom_map(data = world_countries, map = world_countries,
aes(x = long, y = lat, map_id = region),
fill="#ffffff", color="#ffffff", size=0.15)
## Warning: Ignoring unknown aesthetics: x, y
score_plot <- score_plot + geom_map(data = ScoreByCountry_df, map = world_countries,
aes(fill = avg_score, map_id = country),
color="#ffffff", size=0.15)
score_plot <- score_plot + labs(fill = "Average IMDb Score",
title = "Average IMDb Score by Country",
x="", y="")
score_plot <- score_plot + theme(panel.border = element_blank())
#score_plot <- score_plot + theme(panel.background = element_blank())
score_plot <- score_plot + theme(axis.ticks = element_blank())
score_plot <- score_plot + theme(axis.text = element_blank())
score_plot
Based on the chart all countries have around the same score. There is some bias with the African nations who have a small amount of films compared to the other countries.
Investigate to see if number of movie Facebook likes and content rating are indicators of score
#movie_likes <- movie_df %>% select(movie_facebook_likes, imdb_score, content_rating)
plot_ly(movie_df, x = movie_df$movie_facebook_likes, y = movie_df$imdb_score,
color =movie_df$content_rating , mode = "markers",text=paste('Content:',movie_df$content_rating))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
It appears that all the movie ratings tend to score around the same score. It also appears that the R movie rating is the most commonly rated movie.
Clear out environment for information management purposes
rm(list = ls()[!ls() %in% c("movie_df")])
We chose to use Pearson Correlation, because the measures are not ordinal, and measures are linear
Pearson Correlation Table 1
Load Corrplot
library(corrplot)
Create data frame with only numeric types to see how numeric variables correlate to IMDb score
numeric_df <- select_if(movie_df, is.numeric) # Other option cor(movie_df[sapply(movie_df, is.numeric)])
# Creates a correlation table showing variables correlate w/ each other by number and text color
num_correlation <- round(cor(numeric_df, use = "pairwise.complete.obs", method = "pearson"),1)
corrplot(num_correlation, method = "number", type = )
# Note: After running this chunk, you can also select hte num_correlation item from the global environment to view a generic table
Pearson Correlation Table 2, Visualized with Circles
# Creates a correlation table showing variables correlate w/ each other by color and circle size
corrplot(num_correlation, method = "circle")
It appears that most variables have a very small correlation with each other. Some exceptions are the actor_1_facebook_likes and cast_total_facebook_likes as well as budget and gross_profit. This might be because if there are a lot of likes on the lead actor then the rest of the cast will be liked as well. Since there is a low correlation we may not be able to predict the scores very well.
Clear out global environment except for movie_df for information management purposes
rm(list = ls()[!ls() %in% c("movie_df")])
Create data frame to be used for prediction
str(movie_df)
## 'data.frame': 4919 obs. of 29 variables:
## $ color : Factor w/ 2 levels "0","1": 2 2 2 2 NA 2 2 2 2 2 ...
## $ director_name : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
## $ actor_1_name : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 35 ...
## $ movie_title : chr "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 NA 2714 1970 2163 3018 2941 ...
## $ facenumber_in_poster : Factor w/ 19 levels "0","1","2","3",..: 1 1 2 1 1 2 1 2 5 4 ...
## $ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 NA 651 4745 29 1142 2005 ...
## $ movie_imdb_link : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 NA 13 13 13 13 13 ...
## $ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 NA 65 65 65 65 63 ...
## $ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 NA 10 10 9 10 9 ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : Factor w/ 91 levels "1916","1920",..: 84 82 90 87 NA 87 82 85 90 84 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
## $ gross_profit : num 523505847 9404152 -44925825 198130642 NA ...
Eliminate NA values
movie_df <- movie_df %>% na.omit()
Create factors of highest scoring actors and directors
movie_df$dicaprio <- ifelse(movie_df$actor_1_name == "Leonardo DiCaprio", 1, 0)
movie_df$hanks <- ifelse(movie_df$actor_1_name == "Tom Hanks", 1, 0)
movie_df$eastwood <- ifelse(movie_df$actor_1_name == "Clint Eastwood", 1, 0)
movie_df$hardy <- ifelse(movie_df$actor_1_name == "Tom Hardy", 1, 0)
movie_df$rickman <- ifelse(movie_df$actor_1_name == "Alan Rickman", 1, 0)
movie_df$cumberbatch <- ifelse(movie_df$actor_1_name == "Benedict Cumberbatch", 1, 0)
movie_df$bale <- ifelse(movie_df$actor_1_name == "Christian Bale", 1, 0)
movie_df$hoffman <- ifelse(movie_df$actor_1_name == "Philip Seymour Hoffman", 1, 0)
movie_df$driver <- ifelse(movie_df$actor_1_name == "Minni Driver", 1, 0)
movie_df$ford <- ifelse(movie_df$actor_1_name == "Harrison Ford", 1, 0)
movie_df$spacy <- ifelse(movie_df$actor_1_name == "Kevin Spacey", 1, 0)
movie_df$nolan <- ifelse(movie_df$director_name == "Christopher Nolan", 1, 0)
movie_df$tarantino <- ifelse(movie_df$director_name == "Quentin Tarantino", 1, 0)
movie_df$cameron <- ifelse(movie_df$director_name == "James Cameron", 1, 0)
movie_df$jackson <- ifelse(movie_df$director_name == "Peter Jackson", 1, 0)
movie_df$fincher <- ifelse(movie_df$director_name == "David Fincher", 1, 0)
movie_df$scorsese <- ifelse(movie_df$director_name == "Martin Scorsese", 1, 0)
movie_df$anderson <- ifelse(movie_df$director_name == "Wes Anderson", 1, 0)
movie_df$Greengrass <- ifelse(movie_df$director_name == "Paul Greengrass", 1, 0)
Build multiple linear regression model to see how the significance of variables on IMDB Score
Load MASS Package to fit model
library(MASS)
Regress IMDB score on all other variables and use step method to determine which model to use
# Fit the model
score_fit <- lm(imdb_score ~ + country + content_rating + color + title_year + language +
facenumber_in_poster + gross + dicaprio + hanks + eastwood + hardy +
rickman + cumberbatch + bale + hoffman + driver + ford + spacy + nolan +
tarantino + cameron + jackson + fincher + scorsese + anderson + Greengrass,
data = movie_df)
# Perform stepwise model seleciton
score_step <- stepAIC(score_fit, direction = "both")
## Start: AIC=-356.34
## imdb_score ~ +country + content_rating + color + title_year +
## language + facenumber_in_poster + gross + dicaprio + hanks +
## eastwood + hardy + rickman + cumberbatch + bale + hoffman +
## driver + ford + spacy + nolan + tarantino + cameron + jackson +
## fincher + scorsese + anderson + Greengrass
##
##
## Step: AIC=-356.34
## imdb_score ~ country + content_rating + color + title_year +
## language + facenumber_in_poster + gross + dicaprio + hanks +
## eastwood + hardy + rickman + cumberbatch + bale + hoffman +
## ford + spacy + nolan + tarantino + cameron + jackson + fincher +
## scorsese + anderson + Greengrass
##
## Df Sum of Sq RSS AIC
## - title_year 73 119.102 3100.0 -359.10
## - facenumber_in_poster 18 27.612 3008.5 -358.63
## - cameron 1 0.534 2981.4 -357.68
## <none> 2980.9 -356.34
## - jackson 1 2.238 2983.2 -355.59
## - cumberbatch 1 2.416 2983.3 -355.37
## - hardy 1 2.430 2983.3 -355.36
## - eastwood 1 3.270 2984.2 -354.33
## - ford 1 3.464 2984.4 -354.09
## - rickman 1 3.927 2984.8 -353.52
## - dicaprio 1 3.968 2984.9 -353.47
## - Greengrass 1 4.028 2984.9 -353.40
## - bale 1 5.271 2986.2 -351.88
## - nolan 1 8.538 2989.5 -347.88
## - hoffman 1 9.999 2990.9 -346.09
## - scorsese 1 11.535 2992.4 -344.22
## - spacy 1 11.666 2992.6 -344.05
## - fincher 1 12.452 2993.4 -343.10
## - anderson 1 12.532 2993.4 -343.00
## - tarantino 1 13.927 2994.8 -341.29
## - hanks 1 15.530 2996.4 -339.34
## - color 1 24.448 3005.4 -328.47
## - language 28 105.107 3086.0 -285.65
## - country 39 124.203 3105.1 -285.09
## - content_rating 11 171.818 3152.7 -173.46
## - gross 1 264.966 3245.9 -47.00
##
## Step: AIC=-359.1
## imdb_score ~ country + content_rating + color + language + facenumber_in_poster +
## gross + dicaprio + hanks + eastwood + hardy + rickman + cumberbatch +
## bale + hoffman + ford + spacy + nolan + tarantino + cameron +
## jackson + fincher + scorsese + anderson + Greengrass
##
## Df Sum of Sq RSS AIC
## - facenumber_in_poster 18 26.879 3126.9 -363.54
## - cameron 1 1.051 3101.1 -359.86
## <none> 3100.0 -359.10
## - jackson 1 1.795 3101.8 -358.99
## - hardy 1 2.134 3102.1 -358.59
## - cumberbatch 1 2.874 3102.9 -357.71
## - dicaprio 1 2.877 3102.9 -357.71
## + title_year 73 119.102 2980.9 -356.34
## - Greengrass 1 4.148 3104.2 -356.21
## - rickman 1 4.181 3104.2 -356.17
## - bale 1 4.870 3104.9 -355.36
## - nolan 1 7.985 3108.0 -351.70
## - eastwood 1 8.039 3108.1 -351.63
## - ford 1 8.369 3108.4 -351.25
## - hoffman 1 9.059 3109.1 -350.43
## - spacy 1 10.659 3110.7 -348.55
## - anderson 1 11.951 3112.0 -347.04
## - fincher 1 12.590 3112.6 -346.28
## - tarantino 1 14.155 3114.2 -344.45
## - scorsese 1 15.199 3115.2 -343.22
## - hanks 1 15.391 3115.4 -343.00
## - color 1 33.939 3134.0 -321.29
## - language 28 101.592 3201.6 -297.21
## - country 39 137.271 3237.3 -278.69
## - content_rating 11 189.467 3289.5 -164.22
## - gross 1 283.669 3383.7 -40.99
##
## Step: AIC=-363.54
## imdb_score ~ country + content_rating + color + language + gross +
## dicaprio + hanks + eastwood + hardy + rickman + cumberbatch +
## bale + hoffman + ford + spacy + nolan + tarantino + cameron +
## jackson + fincher + scorsese + anderson + Greengrass
##
## Df Sum of Sq RSS AIC
## - cameron 1 1.203 3128.1 -364.13
## <none> 3126.9 -363.54
## - jackson 1 1.729 3128.6 -363.52
## - hardy 1 2.042 3128.9 -363.15
## - dicaprio 1 2.938 3129.8 -362.11
## - cumberbatch 1 3.054 3129.9 -361.97
## - rickman 1 4.184 3131.1 -360.65
## - Greengrass 1 4.478 3131.4 -360.31
## - bale 1 5.034 3131.9 -359.66
## + facenumber_in_poster 18 26.879 3100.0 -359.10
## + title_year 73 118.370 3008.5 -358.63
## - nolan 1 8.428 3135.3 -355.70
## - ford 1 8.501 3135.4 -355.61
## - eastwood 1 8.581 3135.5 -355.52
## - hoffman 1 9.672 3136.6 -354.25
## - spacy 1 9.881 3136.8 -354.00
## - anderson 1 12.058 3139.0 -351.47
## - fincher 1 13.057 3140.0 -350.30
## - tarantino 1 14.276 3141.2 -348.89
## - hanks 1 15.199 3142.1 -347.81
## - scorsese 1 15.511 3142.4 -347.45
## - color 1 35.102 3162.0 -324.73
## - language 28 103.668 3230.6 -300.30
## - country 39 140.741 3267.6 -280.58
## - content_rating 11 192.543 3319.4 -167.08
## - gross 1 287.558 3414.5 -43.90
##
## Step: AIC=-364.13
## imdb_score ~ country + content_rating + color + language + gross +
## dicaprio + hanks + eastwood + hardy + rickman + cumberbatch +
## bale + hoffman + ford + spacy + nolan + tarantino + jackson +
## fincher + scorsese + anderson + Greengrass
##
## Df Sum of Sq RSS AIC
## - jackson 1 1.672 3129.8 -364.18
## <none> 3128.1 -364.13
## - hardy 1 2.027 3130.1 -363.77
## + cameron 1 1.203 3126.9 -363.54
## - cumberbatch 1 3.032 3131.1 -362.59
## - dicaprio 1 3.246 3131.3 -362.34
## - rickman 1 4.161 3132.3 -361.27
## - Greengrass 1 4.445 3132.5 -360.94
## - bale 1 5.026 3133.1 -360.26
## + title_year 73 118.978 3009.1 -359.90
## + facenumber_in_poster 18 27.030 3101.1 -359.86
## - nolan 1 8.270 3136.4 -356.48
## - ford 1 8.387 3136.5 -356.34
## - eastwood 1 8.555 3136.7 -356.15
## - hoffman 1 9.626 3137.7 -354.90
## - spacy 1 9.835 3137.9 -354.66
## - anderson 1 12.054 3140.2 -352.07
## - fincher 1 12.990 3141.1 -350.98
## - tarantino 1 14.166 3142.3 -349.61
## - hanks 1 15.072 3143.2 -348.56
## - scorsese 1 15.295 3143.4 -348.30
## - color 1 35.057 3163.2 -325.39
## - language 28 103.825 3231.9 -300.76
## - country 39 141.335 3269.4 -280.57
## - content_rating 11 194.427 3322.5 -165.68
## - gross 1 300.436 3428.5 -30.85
##
## Step: AIC=-364.18
## imdb_score ~ country + content_rating + color + language + gross +
## dicaprio + hanks + eastwood + hardy + rickman + cumberbatch +
## bale + hoffman + ford + spacy + nolan + tarantino + fincher +
## scorsese + anderson + Greengrass
##
## Df Sum of Sq RSS AIC
## <none> 3129.8 -364.18
## + jackson 1 1.672 3128.1 -364.13
## - hardy 1 2.013 3131.8 -363.83
## + cameron 1 1.146 3128.6 -363.52
## - cumberbatch 1 3.016 3132.8 -362.66
## - dicaprio 1 3.192 3133.0 -362.45
## - rickman 1 4.141 3133.9 -361.35
## - Greengrass 1 4.416 3134.2 -361.02
## - bale 1 5.013 3134.8 -360.33
## + facenumber_in_poster 18 26.961 3102.8 -359.81
## + title_year 73 118.613 3011.2 -359.43
## - nolan 1 8.157 3137.9 -356.66
## - ford 1 8.279 3138.0 -356.52
## - eastwood 1 8.533 3138.3 -356.23
## - hoffman 1 9.593 3139.4 -354.99
## - spacy 1 9.806 3139.6 -354.74
## - anderson 1 12.060 3141.8 -352.12
## - fincher 1 12.939 3142.7 -351.10
## - tarantino 1 14.139 3143.9 -349.70
## - hanks 1 14.944 3144.7 -348.76
## - scorsese 1 15.306 3145.1 -348.34
## - color 1 35.069 3164.8 -325.44
## - language 28 103.929 3233.7 -300.75
## - country 39 146.907 3276.7 -274.48
## - content_rating 11 194.540 3324.3 -165.71
## - gross 1 310.242 3440.0 -20.63
# Show resultsd
score_step$Anova
## NULL
Since the original model had a lower AIC than the stepped models, we chose to use this model
Build final regression model
# Create multiple linear regression model
final_linear <- lm(imdb_score ~ country + content_rating + color + title_year +
language + facenumber_in_poster + gross + dicaprio + hanks +
eastwood + hardy + rickman + cumberbatch + bale + hoffman +
driver + ford + spacy + nolan + tarantino + cameron + jackson +
fincher + scorsese + anderson + Greengrass, data = movie_df)
# Show the summary of the multiple linear regression model
summary(final_linear)
##
## Call:
## lm(formula = imdb_score ~ country + content_rating + color +
## title_year + language + facenumber_in_poster + gross + dicaprio +
## hanks + eastwood + hardy + rickman + cumberbatch + bale +
## hoffman + driver + ford + spacy + nolan + tarantino + cameron +
## jackson + fincher + scorsese + anderson + Greengrass, data = movie_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5274 -0.4954 0.0497 0.6160 2.6719
##
## Coefficients: (6 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.329e+00 1.873e+00 3.913 9.27e-05 ***
## countryArgentina 9.373e-01 1.451e+00 0.646 0.518487
## countryAruba -1.814e+00 1.615e+00 -1.123 0.261410
## countryAustralia -9.442e-02 1.327e+00 -0.071 0.943269
## countryBelgium 6.190e-01 1.613e+00 0.384 0.701205
## countryBrazil 5.535e-01 1.535e+00 0.361 0.718493
## countryCanada -3.129e-01 1.323e+00 -0.236 0.813067
## countryChile 9.237e-01 1.624e+00 0.569 0.569505
## countryChina 7.194e-01 1.419e+00 0.507 0.612087
## countryColombia 7.585e-01 1.638e+00 0.463 0.643421
## countryCzech Republic -1.013e-02 1.482e+00 -0.007 0.994548
## countryDenmark 4.129e-01 1.382e+00 0.299 0.765148
## countryFinland -1.935e-01 1.636e+00 -0.118 0.905890
## countryFrance 5.648e-02 1.323e+00 0.043 0.965937
## countryGeorgia -5.882e-01 1.640e+00 -0.359 0.719836
## countryGermany -3.591e-01 1.323e+00 -0.271 0.786045
## countryGreece 5.631e-01 1.617e+00 0.348 0.727753
## countryHong Kong 1.430e-01 1.385e+00 0.103 0.917781
## countryHungary -1.759e-01 1.615e+00 -0.109 0.913259
## countryIceland 4.918e-01 1.614e+00 0.305 0.760562
## countryIndia 2.898e-01 1.535e+00 0.189 0.850237
## countryIndonesia -3.502e-01 1.875e+00 -0.187 0.851875
## countryIran 1.162e-01 1.656e+00 0.070 0.944059
## countryIreland 6.137e-01 1.364e+00 0.450 0.652820
## countryIsrael 8.487e-01 1.746e+00 0.486 0.626912
## countryItaly 7.221e-01 1.375e+00 0.525 0.599457
## countryJapan -1.104e+00 1.367e+00 -0.807 0.419490
## countryMexico 6.293e-01 1.400e+00 0.449 0.653118
## countryNetherlands 3.862e-01 1.571e+00 0.246 0.805767
## countryNew Line -2.038e+00 1.614e+00 -1.263 0.206727
## countryNew Zealand 6.024e-01 1.364e+00 0.442 0.658778
## countryNorway 1.695e-01 1.549e+00 0.109 0.912873
## countryOfficial site 3.154e-01 1.615e+00 0.195 0.845211
## countryPeru -1.487e+00 1.618e+00 -0.919 0.358114
## countryPoland -9.567e-01 1.614e+00 -0.593 0.553345
## countryRomania -1.717e+00 1.614e+00 -1.064 0.287485
## countryRussia -9.153e-01 1.623e+00 -0.564 0.572797
## countrySouth Africa 7.301e-01 1.424e+00 0.513 0.608140
## countrySouth Korea -1.455e+00 1.427e+00 -1.020 0.307775
## countrySpain 5.936e-01 1.337e+00 0.444 0.657102
## countryTaiwan 7.615e-01 1.572e+00 0.484 0.628200
## countryThailand -6.401e-01 1.618e+00 -0.396 0.692317
## countryUK 3.071e-01 1.319e+00 0.233 0.815908
## countryUSA -1.877e-01 1.318e+00 -0.142 0.886733
## countryWest Germany 3.251e-01 1.667e+00 0.195 0.845358
## content_ratingG 2.861e-01 5.605e-01 0.511 0.609699
## content_ratingGP -7.171e-03 1.266e+00 -0.006 0.995482
## content_ratingM 1.777e+00 1.277e+00 1.392 0.164027
## content_ratingNC-17 3.896e-01 6.823e-01 0.571 0.568028
## content_ratingNot Rated 6.671e-01 5.896e-01 1.131 0.257967
## content_ratingPassed 5.286e-01 1.431e+00 0.369 0.711779
## content_ratingPG 2.355e-01 5.563e-01 0.423 0.672007
## content_ratingPG-13 3.259e-01 5.559e-01 0.586 0.557736
## content_ratingR 7.662e-01 5.556e-01 1.379 0.167992
## content_ratingUnrated 7.639e-01 5.842e-01 1.308 0.191104
## content_ratingX 3.307e-01 6.522e-01 0.507 0.612152
## color1 -5.013e-01 9.409e-02 -5.328 1.06e-07 ***
## title_year1929 -5.364e-01 1.907e+00 -0.281 0.778519
## title_year1933 5.161e-01 1.385e+00 0.373 0.709455
## title_year1935 1.377e+00 1.477e+00 0.932 0.351306
## title_year1936 1.838e+00 1.372e+00 1.340 0.180316
## title_year1937 9.155e-01 1.478e+00 0.619 0.535820
## title_year1939 1.069e+00 1.372e+00 0.779 0.435891
## title_year1940 1.166e+00 1.479e+00 0.788 0.430663
## title_year1946 5.678e-01 1.194e+00 0.476 0.634415
## title_year1947 5.578e-01 1.354e+00 0.412 0.680322
## title_year1948 1.122e+00 1.479e+00 0.759 0.448022
## title_year1950 4.715e-01 1.904e+00 0.248 0.804455
## title_year1952 -8.973e-02 1.356e+00 -0.066 0.947227
## title_year1953 5.829e-02 1.237e+00 0.047 0.962432
## title_year1954 6.804e-01 1.203e+00 0.566 0.571741
## title_year1957 1.442e+00 1.371e+00 1.052 0.293083
## title_year1959 1.113e+00 1.355e+00 0.822 0.411332
## title_year1960 1.184e+00 1.369e+00 0.864 0.387428
## title_year1961 6.800e-01 1.385e+00 0.491 0.623475
## title_year1962 1.379e+00 1.239e+00 1.113 0.265727
## title_year1963 9.024e-01 1.270e+00 0.711 0.477350
## title_year1964 1.051e+00 1.189e+00 0.884 0.376774
## title_year1965 7.887e-01 1.113e+00 0.709 0.478631
## title_year1966 2.027e+00 1.602e+00 1.265 0.205877
## title_year1967 2.514e-01 1.479e+00 0.170 0.865035
## title_year1968 9.886e-01 1.206e+00 0.820 0.412514
## title_year1969 -7.901e-01 1.380e+00 -0.572 0.567098
## title_year1970 8.578e-01 1.117e+00 0.768 0.442684
## title_year1971 1.117e-01 1.202e+00 0.093 0.925967
## title_year1972 8.350e-01 1.218e+00 0.686 0.492969
## title_year1973 3.900e-01 1.090e+00 0.358 0.720669
## title_year1974 6.429e-01 1.075e+00 0.598 0.549741
## title_year1975 1.246e+00 1.141e+00 1.092 0.274881
## title_year1976 9.334e-01 1.201e+00 0.777 0.437231
## title_year1977 3.704e-01 1.076e+00 0.344 0.730717
## title_year1978 -8.631e-03 1.076e+00 -0.008 0.993599
## title_year1979 3.250e-01 1.077e+00 0.302 0.762879
## title_year1980 7.586e-02 1.039e+00 0.073 0.941797
## title_year1981 -2.230e-01 1.033e+00 -0.216 0.829124
## title_year1982 -1.604e-01 1.037e+00 -0.155 0.877057
## title_year1983 7.472e-02 1.039e+00 0.072 0.942673
## title_year1984 1.531e-02 1.028e+00 0.015 0.988115
## title_year1985 8.615e-03 1.035e+00 0.008 0.993356
## title_year1986 -4.902e-01 1.023e+00 -0.479 0.631770
## title_year1987 -4.195e-01 1.021e+00 -0.411 0.681065
## title_year1988 -3.083e-01 1.021e+00 -0.302 0.762689
## title_year1989 -9.991e-03 1.020e+00 -0.010 0.992181
## title_year1990 -3.821e-02 1.023e+00 -0.037 0.970191
## title_year1991 -3.431e-01 1.020e+00 -0.336 0.736635
## title_year1992 -1.380e-01 1.019e+00 -0.135 0.892302
## title_year1993 -1.108e-01 1.015e+00 -0.109 0.913079
## title_year1994 -2.272e-01 1.014e+00 -0.224 0.822799
## title_year1995 -3.471e-01 1.012e+00 -0.343 0.731743
## title_year1996 -3.868e-01 1.011e+00 -0.383 0.701945
## title_year1997 -4.127e-01 1.011e+00 -0.408 0.683055
## title_year1998 -3.492e-01 1.009e+00 -0.346 0.729221
## title_year1999 -4.136e-01 1.009e+00 -0.410 0.681814
## title_year2000 -5.440e-01 1.008e+00 -0.540 0.589387
## title_year2001 -5.101e-01 1.008e+00 -0.506 0.612984
## title_year2002 -5.422e-01 1.008e+00 -0.538 0.590776
## title_year2003 -6.060e-01 1.008e+00 -0.601 0.547590
## title_year2004 -3.764e-01 1.007e+00 -0.374 0.708599
## title_year2005 -3.496e-01 1.008e+00 -0.347 0.728828
## title_year2006 -4.260e-01 1.007e+00 -0.423 0.672440
## title_year2007 -2.737e-01 1.008e+00 -0.271 0.786070
## title_year2008 -4.523e-01 1.007e+00 -0.449 0.653411
## title_year2009 -4.545e-01 1.007e+00 -0.451 0.651876
## title_year2010 -4.426e-01 1.008e+00 -0.439 0.660724
## title_year2011 -4.526e-01 1.008e+00 -0.449 0.653483
## title_year2012 -4.039e-01 1.008e+00 -0.401 0.688630
## title_year2013 -2.808e-01 1.009e+00 -0.278 0.780738
## title_year2014 -3.879e-01 1.009e+00 -0.384 0.700711
## title_year2015 -4.010e-01 1.010e+00 -0.397 0.691268
## title_year2016 -3.577e-01 1.014e+00 -0.353 0.724349
## languageArabic 3.497e-01 1.150e+00 0.304 0.761112
## languageAramaic 6.829e-02 1.144e+00 0.060 0.952421
## languageBosnian -2.655e+00 1.145e+00 -2.319 0.020442 *
## languageCantonese -5.347e-02 8.634e-01 -0.062 0.950619
## languageCzech 2.309e-01 1.328e+00 0.174 0.862022
## languageDanish 3.159e-01 9.511e-01 0.332 0.739803
## languageDari 8.389e-01 1.145e+00 0.733 0.463624
## languageDutch NA NA NA NA
## languageEnglish -6.749e-01 6.639e-01 -1.017 0.309428
## languageFilipino -9.022e-01 1.146e+00 -0.787 0.431175
## languageFrench 3.399e-01 6.919e-01 0.491 0.623233
## languageGerman 6.542e-01 7.513e-01 0.871 0.383950
## languageHebrew NA NA NA NA
## languageHindi NA NA NA NA
## languageHungarian 3.107e-02 1.475e+00 0.021 0.983199
## languageIndonesian 7.912e-01 1.159e+00 0.683 0.494705
## languageItalian -1.211e+00 8.404e-01 -1.441 0.149624
## languageJapanese 1.545e+00 8.007e-01 1.930 0.053701 .
## languageKazakh -1.301e+00 1.147e+00 -1.134 0.256923
## languageKorean 2.010e+00 9.536e-01 2.108 0.035139 *
## languageMandarin -7.189e-01 8.557e-01 -0.840 0.400936
## languageMaya 5.971e-01 1.144e+00 0.522 0.601845
## languageMongolian 8.705e-01 1.485e+00 0.586 0.557752
## languageNone 1.976e+00 1.144e+00 1.726 0.084373 .
## languageNorwegian NA NA NA NA
## languagePersian 1.256e+00 1.320e+00 0.952 0.341375
## languagePortuguese NA NA NA NA
## languageRomanian 2.391e+00 1.487e+00 1.608 0.107883
## languageRussian 1.916e-01 1.485e+00 0.129 0.897343
## languageSpanish -5.043e-01 7.224e-01 -0.698 0.485193
## languageThai 1.194e-01 1.268e+00 0.094 0.925025
## languageVietnamese 4.734e-01 1.145e+00 0.414 0.679227
## languageZulu -2.728e-01 1.141e+00 -0.239 0.811075
## facenumber_in_poster1 8.938e-03 3.983e-02 0.224 0.822459
## facenumber_in_poster2 -5.615e-02 4.866e-02 -1.154 0.248672
## facenumber_in_poster3 -6.115e-02 6.136e-02 -0.997 0.319037
## facenumber_in_poster4 -1.397e-01 7.994e-02 -1.748 0.080556 .
## facenumber_in_poster5 -1.906e-01 1.125e-01 -1.695 0.090160 .
## facenumber_in_poster6 -1.151e-01 1.289e-01 -0.894 0.371581
## facenumber_in_poster7 -1.545e-01 1.787e-01 -0.864 0.387467
## facenumber_in_poster8 -1.706e-01 1.730e-01 -0.986 0.324047
## facenumber_in_poster9 -2.779e-01 2.958e-01 -0.939 0.347589
## facenumber_in_poster10 -4.249e-01 3.818e-01 -1.113 0.265918
## facenumber_in_poster11 -1.663e+00 4.176e-01 -3.983 6.95e-05 ***
## facenumber_in_poster12 -2.029e-01 5.381e-01 -0.377 0.706203
## facenumber_in_poster13 -3.049e-01 9.328e-01 -0.327 0.743810
## facenumber_in_poster14 1.053e-01 1.008e+00 0.104 0.916808
## facenumber_in_poster15 1.779e-01 4.664e-01 0.381 0.702859
## facenumber_in_poster19 -1.049e+00 9.319e-01 -1.126 0.260432
## facenumber_in_poster31 7.011e-01 9.317e-01 0.752 0.451805
## facenumber_in_poster43 1.722e+00 9.314e-01 1.849 0.064605 .
## gross 4.385e-09 2.500e-10 17.540 < 2e-16 ***
## dicaprio 4.760e-01 2.218e-01 2.146 0.031911 *
## hanks 8.224e-01 1.937e-01 4.246 2.23e-05 ***
## eastwood 5.695e-01 2.923e-01 1.949 0.051426 .
## hardy 5.444e-01 3.241e-01 1.680 0.093133 .
## rickman 7.105e-01 3.328e-01 2.135 0.032817 *
## cumberbatch 5.934e-01 3.543e-01 1.675 0.094046 .
## bale 5.215e-01 2.108e-01 2.474 0.013418 *
## hoffman 7.741e-01 2.272e-01 3.407 0.000664 ***
## driver NA NA NA NA
## ford 3.850e-01 1.919e-01 2.006 0.044978 *
## spacy 7.366e-01 2.001e-01 3.680 0.000236 ***
## nolan 1.086e+00 3.449e-01 3.149 0.001655 **
## tarantino 1.339e+00 3.330e-01 4.021 5.91e-05 ***
## cameron 2.851e-01 3.622e-01 0.787 0.431173
## jackson 6.078e-01 3.770e-01 1.612 0.107022
## fincher 1.124e+00 2.956e-01 3.802 0.000146 ***
## scorsese 9.034e-01 2.469e-01 3.660 0.000256 ***
## anderson 1.347e+00 3.532e-01 3.814 0.000139 ***
## Greengrass 7.659e-01 3.542e-01 2.162 0.030649 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9281 on 3461 degrees of freedom
## Multiple R-squared: 0.271, Adjusted R-squared: 0.2302
## F-statistic: 6.633 on 194 and 3461 DF, p-value: < 2.2e-16
Remove MASS package from the environment as it affects the dplyr package
detach("package:MASS", unload = F)
Load caret Package
library(caret)
Create 10-Fold validation configuration of control
fit_control <- trainControl(method = "cv", number = 10)
Train Multiple Linear Regression model to predict imdb_score
# Set seed to obtain same results each time
set.seed(1234)
# Train linear model
multi_fit <- train(imdb_score ~ country + content_rating + color + title_year +
language + facenumber_in_poster + gross + dicaprio + hanks +
eastwood + hardy + rickman + cumberbatch + bale + hoffman +
driver + ford + spacy + nolan + tarantino + cameron + jackson +
fincher + scorsese + anderson + Greengrass,
data = movie_df,
trControl = fit_control,
method = "lm")
# Print results
print(multi_fit)
## Linear Regression
##
## 3656 samples
## 26 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3291, 3291, 3291, 3291, 3290, 3291, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.9511951 0.1993298 0.7377693
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Train Gradient Boosted Machine Method to predict imdb_score
set.seed(1234)
gbm_fit <- train(imdb_score ~ country + content_rating + color + title_year +
language + facenumber_in_poster + gross + dicaprio + hanks +
eastwood + hardy + rickman + cumberbatch + bale + hoffman +
driver + ford + spacy + nolan + tarantino + cameron + jackson +
fincher + scorsese + anderson + Greengrass,
data = movie_df,
trControl = fit_control,
method = "gbm",
verbose = F)
print(gbm_fit)
## Stochastic Gradient Boosting
##
## 3656 samples
## 26 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3291, 3291, 3291, 3291, 3290, 3291, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 0.9815368 0.1611215 0.7606611
## 1 100 0.9664423 0.1741769 0.7485170
## 1 150 0.9580304 0.1845483 0.7411589
## 2 50 0.9656447 0.1743920 0.7473884
## 2 100 0.9561040 0.1851505 0.7390235
## 2 150 0.9522526 0.1906169 0.7348484
## 3 50 0.9601786 0.1813499 0.7418423
## 3 100 0.9535586 0.1881617 0.7358058
## 3 150 0.9542481 0.1871194 0.7357722
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
Train Support Machine Vector to predict imdb_score
svm_fit <- train(imdb_score ~ country + content_rating + color + title_year +
language + facenumber_in_poster + gross + dicaprio + hanks +
eastwood + hardy + rickman + cumberbatch + bale + hoffman +
driver + ford + spacy + nolan + tarantino + cameron + jackson +
fincher + scorsese + anderson + Greengrass,
data = movie_df,
trControl = fit_control,
method = "svmRadial",
verbose = F)
print(svm_fit)
## Support Vector Machines with Radial Basis Function Kernel
##
## 3656 samples
## 26 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3291, 3291, 3290, 3289, 3290, 3290, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.060589 0.002779784 0.8141206
## 0.50 1.060359 0.002658155 0.8141450
## 1.00 1.059011 0.002070716 0.8143338
##
## Tuning parameter 'sigma' was held constant at a value of 2.26418e-14
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 2.26418e-14 and C = 1.
Train Decision Tree to predict imdb_score
dtree_fit <- train(imdb_score ~ country + content_rating + color + title_year +
language + facenumber_in_poster + gross + dicaprio + hanks +
eastwood + hardy + rickman + cumberbatch + bale + hoffman +
driver + ford + spacy + nolan + tarantino + cameron + jackson +
fincher + scorsese + anderson + Greengrass,
data = movie_df,
method = "rpart",
trControl = fit_control,
tuneLength = 20,
parms=list(split='information'))
print(dtree_fit)
## CART
##
## 3656 samples
## 26 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3291, 3290, 3290, 3291, 3291, 3290, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.001738798 0.9907022 0.13278909 0.7630285
## 0.002068381 0.9820585 0.14275199 0.7585432
## 0.002131491 0.9806695 0.14487365 0.7577790
## 0.002342080 0.9768093 0.14983026 0.7539785
## 0.002369088 0.9768093 0.14983026 0.7539785
## 0.002390306 0.9764477 0.15020232 0.7540550
## 0.002431604 0.9764581 0.15019392 0.7540253
## 0.002551606 0.9755093 0.15163191 0.7535363
## 0.003467947 0.9743540 0.15318537 0.7546189
## 0.003775077 0.9742853 0.15326460 0.7545389
## 0.004995151 0.9786091 0.14613286 0.7595301
## 0.005116126 0.9791663 0.14516994 0.7602984
## 0.005492177 0.9803517 0.14313246 0.7611767
## 0.006169864 0.9825830 0.13899321 0.7634161
## 0.009904726 0.9871068 0.13034482 0.7648497
## 0.010872825 0.9894890 0.12620686 0.7667665
## 0.013958926 0.9996821 0.10838720 0.7748752
## 0.032847341 1.0160793 0.07906620 0.7891151
## 0.035907783 1.0288863 0.05627205 0.7984240
## 0.041406660 1.0515144 0.02531107 0.8166404
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.003775077.
# Output the final model for the decision tree
rpart.plot::rpart.plot(dtree_fit$finalModel)
Use simple splitting to predict imdb_score
# Create predictive df with columns used for other models
predictive_df <- movie_df %>% select(imdb_score, country, content_rating, color, title_year,
language, facenumber_in_poster, gross, dicaprio, hanks,
eastwood, hardy, rickman, cumberbatch, bale, hoffman,
driver, ford, spacy, nolan, tarantino, cameron, jackson,
fincher, scorsese, anderson, Greengrass)
# This was done to simplify subsutting of the df for train_data and test_data
set.seed(1234)
split_fit <- createDataPartition(predictive_df$imdb_score, p = .8, list = F)
train_data <- predictive_df[split_fit,]
test_data <- predictive_df[-split_fit,]
m1 <- lm(imdb_score~., data = test_data)
# Make predictions
x_test <- test_data[,-1]
y_test <- test_data[,1]
predictions <- predict(multi_fit, test_data)
# Summarize results
simple_split <- postResample(predictions, y_test)
simple_split
## RMSE Rsquared MAE
## 0.9128267 0.2776292 0.6935228
Resample and show results to find the best model
results <- resamples(list(linear = multi_fit, GBM = gbm_fit, SVM = svm_fit, rpart = dtree_fit))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: linear, GBM, SVM, rpart
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## linear 0.6721460 0.7182493 0.7450640 0.7377693 0.7696409 0.7801051 0
## GBM 0.6852720 0.7185370 0.7418544 0.7348484 0.7505480 0.7731188 0
## SVM 0.7855687 0.7997161 0.8086314 0.8143338 0.8321575 0.8438801 0
## rpart 0.7133329 0.7454524 0.7529411 0.7545389 0.7681900 0.8044121 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## linear 0.8577889 0.9273996 0.9568565 0.9511951 0.9912828 1.015070 0
## GBM 0.8641010 0.9306892 0.9549358 0.9522526 0.9836367 1.011582 0
## SVM 1.0061206 1.0242484 1.0477408 1.0590112 1.0956613 1.121755 0
## rpart 0.8992071 0.9548304 0.9794503 0.9742853 0.9970306 1.026799 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu.
## linear 1.073202e-01 0.1763706475 0.2036895384 0.199329762 0.23621639
## GBM 1.301923e-01 0.1642848221 0.2090832055 0.190616876 0.21689205
## SVM 1.043988e-05 0.0001405843 0.0009386559 0.002070716 0.00307456
## rpart 1.034697e-01 0.1475240234 0.1562923170 0.153264596 0.16885243
## Max. NA's
## linear 0.253053285 0
## GBM 0.226620546 0
## SVM 0.009224642 1
## rpart 0.192979345 0
The Linear Model with a 10-fold cross validation has the best performance. It was the lowest RMSE and the highest R2.
According to the linear model, there is a significantly negative effect of having a film in color. This may relate to how newer films are now in color. When you consider that films have slightly lowered in their average IMDb score over time, it may not be ideal to recommend that film producers shoot their films in grayscale if they want to improve performance.
The decision tree can give better actionable steps to improve the IMDb rating for a film to be released in the future. In general, movies that are released in the UK, aren’t produced in color, are rated R, and aren’t in English tend to have higher IMDb scores. Movies with a higher gross tend to have the highest IMDb scores. There are certain actors and directors who consistently perform well and are featured in films with high ratings. It may be recommended to consider any of these factors for a film producer who wants to improve their movie’s critical response.